Discretization Algorithm that Uses Class-Attribute Interdependence Maximization

نویسندگان

  • Lukasz Kurgan
  • Krzysztof J. Cios
چکیده

Most of the existing machine learning algorithms are able to extract knowledge from databases that store discrete attributes (features). If the attributes are continuous, the algorithms can be integrated with a discretization algorithm that transforms them into discrete attributes. The paper describes an algorithm, called CAIM (class-attribute interdependence maximization), for discretization of continuous attributes that is designed to work with supervised learning algorithms. The algorithm maximizes the class-attribute interdependence and, at the same time, generates possibly minimal number of discrete intervals. Its big advantage is that it does not require the user to pre-define the number of intervals, in contrast to many existing discretization algorithms. The CAIM algorithm and five other stateof-the-art discretization algorithms were tested on well-known machine learning datasets consisting of continuous and mixed-mode attributes. The tests show that the proposed algorithm generates discrete attributes with, almost always, the highest classattribute interdependency when compared with other algorithms, and at the same time it always generates the lowest number of intervals. The discretized datasets were used in conjunction with the CLIP4 machine learning algorithm. The accuracy of the rules generated by the CLIP4 shows that the proposed algorithm significantly improves classification performance; it also performs best in comparison with other five discretization algorithms. The CAIM algorithm’s speed is comparable to the simplest unsupervised algorithms and outperforms other supervised discretization algorithms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Class-Attribute Interdependence Maximization (CAIM) Discretization Algorithm

Discretization is a process of converting a continuous attribute into an attribute that contains small number of distinct values. One of the major reasons for discretizing an attribute is that some of the machine learning algorithms perform poorly with continuous attribute and thus require front-end discretization of the input data. The paper describes a Fast Class-Attribute Interdependence Max...

متن کامل

A Novel Tree Based Classification

Classification is a data mining (DM) technique used to predict or forecast the unknown information using the historical data. There are many classification techniques. ID3 is a very popular tree based classification algorithm for a categorical data which does not support continuous data. Attribute selection process plays major role in building a classification tree model. Attribute Selection in...

متن کامل

A Discretization Algorithm for Uncertain Data

This paper proposes a new discretization algorithm for uncertain data. Uncertainty is widely spread in real-world data. Numerous factors lead to data uncertainty including data acquisition device error, approximate measurement, sampling fault, transmission latency, data integration error and so on. In many cases, estimating and modeling the uncertainty for underlying data is available and many ...

متن کامل

The Interaction of Entropy-Based Discretization and Sample Size: An Empirical Study

An empirical investigation of the interaction of sample size and discretization – in this case the entropy-based method CAIM (Class-Attribute Interdependence Maximization) – was undertaken to evaluate the impact and potential bias introduced into data mining performance metrics due to variation in sample size as it impacts the discretization process. Of particular interest was the effect of dis...

متن کامل

Experiments with Decision Tree Classifiers – Discretization of Numerical Attributes

Classification algorithms are used in numerous applications everyday, from assigning letter grades to student student’s scores, to computerized letter recognition in mail processing. Discretization consists of applying a set of rules to reduce the number of discrete intervals from which an attribute is assigned. Discretization is generally applied to datasets whose numerical range consists of c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001